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FREQUENTIST OPTIMALITY OF BAYESIAN WAVELET 
SHRINKAGE RULES FOR GAUSSIAN AND 
NON-GAUSSIAN NOISE 1 

By Marianna Pensky 

University of Central Florida 

The present paper investigates theoretical performance of various 
Bayesian wavelet shrinkage rules in a nonparametric regression model 
with i.i.d. errors which are not necessarily normally distributed. The 
main purpose is comparison of various Bayesian models in terms of 
their frequentist asymptotic optimality in Sobolev and Besov spaces. 

We establish a relationship between hyperparameters, verify that 
the majority of Bayesian models studied so far achieve theoretical 
optimality, state which Bayesian models cannot achieve optimal con- 
vergence rate and explain why it happens. 

1. Introduction. Bayesian techniques for shrinking wavelet coefficients 
have become very popular in the last few years. Starting with the paper by 
Clyde, Parmigiani and Vidakovic [9], researchers turned to Bayesian meth- 
ods in wavelet analysis (see, e.g., [1, 2, 3, 4, 6, 7, 8, 10, 17, 19, 24, 26]). 

The majority of the above papers were devoted to the nonparametric 
wavelet regression model 

(1.1) Y i = f(i/n) + Z l , i = l,...,n, 

where / is unknown and belongs to some functional space T and Zi are 
i.i.d. random variables. In most cases, the errors Zi were assumed to be 
normally distributed with zero mean and unknown variance a 2 . Some papers 
in engineering fields modeled errors as double-exponential (see, e.g., [20] or 
[23]); however, this approach was not very popular in statistical applications. 
The only paper known to the author which deals with nonnormal errors in 
a Bayesian framework is the one by Clyde and George [8] where the authors 
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conduct comprehensive simulations with Z{ having various distributions. In 
what follows we just assume that Z^s are i.i.d., symmetric and have finite 
fourth moment EZf < oo. 

Now let us discuss the error model for wavelet coefficients which is used 
in Bayesian inference. If the Zj's in (1.1) are i.i.d. normal, the errors in 
wavelet coefficients are also i.i.d. normal, and most authors adopted this 
"real" model for Bayesian inference. However, some of the authors (Clyde 
and George [8] and Vidakovic [24]) were more adventurous and considered 
other distributions for error models, namely, distributions that result from 
a mixture of normal distributions over the scale parameter. Various choices 
of priors were suggested, and the posterior mean or median was used as 
an estimator of wavelet coefficients. The procedures were then justified by 
extensive simulations. Thus, a variety of algorithms have been suggested 
for applications; all of them have demonstrated good computational perfor- 
mance, but until the present time no attempts have been made to examine 
frequentist properties of these techniques or to compare various algorithms 
with each other. 

To the best of the author's knowledge, three manuscripts have been pro- 
duced recently with the objective to assess frequentist properties of Bayesian 
wavelet shrinkage and thresholding. The first one, by Abramovich, Amato 
and Angelini [1], explores optimality of Bayesian wavelet estimators in the 
case of normal errors in (1.1) and a combination of a point mass at zero 
and a normal density for the prior. The authors consider three choices of 
Bayesian wavelet estimators: the mean, the median and the mean thresh- 
olded on the basis of a Bayes factor rule [24], and then study optimality 
of the resulting wavelet regression estimators in Besov spaces. Johnstone 
and Silverman [19] go much further than Abramovich, Amato and Angelini 
[1], investigating theoretical and computational properties of the empirical 
Bayes thresholding rules. They also assume the errors to be normally dis- 
tributed and choose priors to be a mixture of an atom probability at zero and 
a heavy-tailed density. The mixing weight, or sparsity parameter, for each 
level of the transform is chosen by maximizing marginal likelihood. The esti- 
mators can be based on a posterior mean or a posterior median, and they are 
optimal in Besov spaces. Finally, Autin, Picard and Rivoirard [5] compare a 
wide variety of Bayes shrinkage and thresholding rules based on both normal 
and heavy-tailed priors using a novel maxiset approach. However, although 
Autin, Picard and Rivoirard [5] and Johnstone and Silverman [19] develop 
breakthrough theory concerning regression with normally distributed errors, 
they do not discuss the case of a non-Gaussian distribution either for the 
actual error in (1.1) or as the error model in Bayesian inference. 

The present paper is in a sense complementary to the three papers men- 
tioned above as well as to Clyde and George [8]. Our interest was sparked 
by the excellent paper by Clyde and George [8], who present an amazing 
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volume of simulations for various error distributions and various Bayesian 
models. All of their Bayesian models produce estimators as efficient (or bet- 
ter) than the ones based on traditional thresholding rules no matter whether 
the errors in (1.1) are normal or not. Yet, no theoretical study of these es- 
timators is conducted; hence, some of the simulation results are stated but 
not explained. 

The present paper examines various Bayesian models in terms of their 
frequentist asymptotic optimality without a predominant assumption that 
errors in (1.1) are normally distributed. Bayesian estimators for the wavelet 
coefficients are constructed using posterior means for a variety of error mod- 
els and priors, and these models are compared in terms of their frequentist 
optimality. Although theoretical results of the paper are asymptotic, they al- 
low one to explain some features of the simulations conducted by Clyde and 
George [8], namely, differences in the performances of "normal error-normal 
prior" model and the models with heavy-tailed error and prior distributions. 

The material presented in the paper is interesting from several points of 
view. First, the relationship between hyperparameters is established, and it 
is verified that the majority of Bayesian models studied so far in the lit- 
erature have indeed not only excellent computational properties but also 
theoretical optimality. Second, it is shown that some Bayesian models can- 
not achieve the optimal convergence rate and explained why this happens. 
Third, the prior distributions on wavelet coefficients are compared to see 
which of them agree with the assumption that the regression function be- 
longs to a particular Besov space. The latter leads to a more comprehensive 
examination of the mixing weights. For example, we discover that the weights 
chosen in [1, 3] lead to convergence rates which differ from optimal (e.g., by 
the logarithmic factor in the case of normal errors and normal priors). 

The rest of the paper is organized as follows. In Section 2 we introduce 
Bayesian models for wavelet coefficients. We use some "arbitrary" error 
model rjj(-) and a mixture of a point mass at zero and a density £(•) for 
a prior, keeping in mind that the actual distribution of wavelet coefficients 
is unknown at fine resolution levels and is asymptotically normal at coarse 
resolution levels according to the central limit theorem. In Section 3 we dis- 
cuss assumptions on and £(•) and provide assertions about asymptotic 
optimality of regression estimators in Besov spaces for various choices of 
rjj(-) and £(•). In Section 4 we present comparison between various Bayesian 
models in tabular form and discuss theoretical results of Section 3. We also 
explain results of simulations conducted in [8] . Section 5 contains some aux- 
iliary statements as well as proofs of the statements in Section 3. 

2. The Bayesian model. Consider a standard nonparametric regression 
model (1.1) and assume that / is square integrable and Z{ are i.i.d. random 
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variables with zero mean and finite variance. Application of the discrete 
wavelet transform (DWT) based on the wavelet ip to (1.1) yields 

(2.1) Y jk = w jk + e jk , j = 0,..., J- l,k = 0,...,2 J - 1, J = log 2 n, 

where Ej k are uncorrelated random variables due to the unitary property of 
the DWT. We assume that the wavelet function tp has finite support and 
s > r vanishing moments. If one is interested in the error of the resulting 
estimator of / only at points i/n, one can just use periodized orthogonal 
wavelets. However, since in this paper we are interested in the L 2 -norm based 
risk as the measure of the quality of the estimation procedure, some kind 
of boundary correction is necessary. For this reason, we shall use boundary 
coiflets introduced in [18, 19], which we describe in detail in Section 3.1. 

Denote the scaling and the wavelet coefficients of the original function / 
by 6k and 6j k , respectively, so that / can be reconstructed as 

2 L -1 oo 2^-1 

(2.2) f(x) = E 0k1 L,2 y{2 L x - k) + J2 E ^jk(x) 

k=0 j=L k=0 

with 9 k = ^ 00 2 L / 2 i P (2 L x - k)f{x)dx and 6 jk = J^° 00 ip jk (x)f(x)dx. Here 
<p(x) is a scaling function corresponding to the mother wavelet tp(x), ipj k (x) = 
2^l 2 ^){2?x — k) and the value of L will be defined later. Denote 8j k = Wj k /y/n 
and recall that 6j k « 6j k (see, e.g., [25]). Later we shall provide a more de- 
tailed treatment of the relation between 9j k and 8j k . 

We shall use a Bayesian technique to construct estimators 6j k of 6j k based 
on Yj k . Since wavelet representations of a vast majority of functions contain 
only a few nonnegligible coefficients in their expansions, we place the follow- 
ing prior on the population discrete wavelet coefficient Wj k : 

(2.3) £(T in aO + (l--7r jn )$(0), 

where < TTj n <lifO<j<J — 1 and 7Tj n = if j > J, 5(0) is a point mass 
at zero and Wj k are independent. The p.d.f. £(x) is even and unimodal. Note 
that the majority of priors used previously for Bayesian wavelet inference 
follow the model (2.3). The factors irj n are the prior probabilities that a 
wavelet coefficient at level j does not vanish. In what follows, however, we 
shall impose all conditions on the odds 

(2.4) jn = (1 - TVjn)/TTj n - 

Note that we allow dependence of TTj n not only on the resolution level j 
but also on n. It is most natural since the proportion of coefficients we 
are intending to keep depends not only on the function / but also on the 
amount of data available: the larger n is, the more reliable the estimators of 
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coefficients are, the smaller the coefficients we can keep, the larger the value 
of nj n for a particular resolution level j. 

Now let us discuss the distribution of errors. It follows from (1.1) that 
e jk ^ n- l l 2 2jl 2 YS=i^{yi/n - k)Zi. Since Z % are i.i.d. with EZf < oo, the 
sequence 2 3 / 2 n~ l / 2 i^{2H/n — k)Zi satisfies the Lyapunov condition with 5 = 
2 provided 2 J /n — > 0. Hence, if the resolution level is reasonably small (j < Jo 
where J — Jo — > oo as n — > oo), the errors Ejk are asymptotically normally 
distributed N(0,a 2 ) and, thus, asymptotically independent, while at large 
resolution levels they are uncorrelated and have some unknown p.d.f. fJ>(x) 
which is symmetric and has a finite fourth moment. To incorporate both of 
these features into the distribution of Ejk we write it as 

(2.5) E jk ~ (1 - A j )(o-v / 2^)" 1 exp(-x 2 /(2<T 2 )) + \j(i(x). 

Here a 2 = f^° 00 x 2 /j,(x)dx = EZ 2 , < Xj < 1, and Xj are close to 1 at large 
resolution levels and equal to zero at small resolution levels, namely, Xj = 1 
if j = J — 1 and Xj = if j < Jo . For a more detailed treatment of asymptotic 
normality an interested reader can consult Neumann and von Sachs [21]. 

The difficulty of using (2.5) in Bayesian inference is that both n(x) and Xj 
are unknown. For this reason, we shall choose the most general distribution 
for errors, namely, 

(2.6) Ejk~r]j(x), 

where r\j are level-dependent symmetric densities. As we shall show later, 
one does not need the knowledge of the true distribution of Ejk and can 
obtain the optimal estimator of / with a variety of error distributions rjj. 
Moreover, r\j does not even need to be level dependent. 

The model (2.6) generalizes the setup of Clyde and George [8], who con- 
sidered error distributions other than normal in (1.1), namely, the distri- 
butions that can be obtained from the normal distribution by mixing over 
the scale parameter. It also follows the remarkable idea of Vidakovic [24], 
who assumed that the errors are normally distributed but used a double- 
exponential distribution for errors which resulted from the hierarchical Bayes 
approach, hence, conducting Bayesian inference with the error distribution 
different from the true one. 

We shall conduct Bayesian inference for each wavelet coefficient sepa- 
rately. Denote 

(2.7) djk = Yjkl Vn, Vj = \fnT jn . 

Taking into account the relation between Wjk and 6jk and (2.3)— (2.7), we 
derive that the posterior p.d.f. of djk given djk is of the form 

p(0- k \d k ) - ^™7j(^(fy fc ~ d jk))vji{vjOjk) + PjnVnr]j(^/ndjk)5(0) 
3k 3k f^\/nr]j(y/n(x-djk))vj£(vjx)dx + (3 jn ^rij(y/ndjk)' 
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Choosing the posterior mean to be an estimator, we arrive at the following 
estimator of 6jk- 



hjjdjk) 
hj{djk) + PjnVnVjiVndjk) ' 



( 2 -8) ^ = 777T7raoT. J<J-1, 



where 

/oo 
x l \paT]j\\fn{x — d)\v£{yjx) dx, i = 0,1, 
-oo 

and 9jk = as j > J, so that the estimator of / is of the form 

2 L -1 J-12-j-l 

(2.10) f(x) = Y, 0k2 L/ \(2 L x - *0 + E E hk^jk{x). 

k=0 j=L k=0 

The objective of the present paper is to formulate conditions under which 
the estimator of / based on coefficients is optimal and explore the cases 
where this optimality cannot be achieved. For any possible estimator / of / 
based on n observations X\ , . . . , X n , define the mean integrated square error 
(MISE) over a function space T as 

(2.H) R n (F,f) = S upE\\f-f\\l 2[ ] . 

Donoho and Johnstone [14] showed that when the errors Zi in (1.1) are 
normally distributed and / belongs to a ball B r pq (A) in the Besov space 
Bp q [0, 1], then there exist constants C± and Ci independent of n such that 

(2.12) (7 in -2r/(2r+i) < \rd ^(BL JA) J ) < C 2 rr 2r l^ T +^ 

f 

provided r > max(0, 1/p — 1/2) and p,q> 1. Since the Sobolev space H r = 
-B22; the same rates of convergence hold for H r (A). Note that since the 
normal distribution for errors is a particular case of (2.5), the lower bounds 
in our situation cannot be smaller than (2.12). On the other hand, since 
for a majority of resolution levels (j < Jo) the errors follow the normal 
model, we can expect to achieve convergence rates (2.12) as n — > oo for some 
choices of error models (2.6) and priors (2.3). 

In what follows we shall compare various Bayesian models in terms of 
their ability to achieve optimal convergence rates (2.12) as oo. 



3. Asymptotic optimality. 
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3.1. Assumptions. In what follows we shall formulate conditions on the 
p.d.f.'s £(•) and 77^ ( - ) and parameters Uj, j3j n and Jo. We warn the readers 
that not all of the conditions listed below will necessarily appear in every 
statement later. 

Let 4> an d ip be boundary coiflets introduced in [18, 19], possessing s > r 
vanishing moments and based on orthonormal coinets supported in [-S + 
1, S], s < S. Assume that p > 1, r > max(l/2, 1/p) and 



(3 1} r > r p = 0.5[(l/p - 1/2) + ^(l/p - 1/2)2 + 2(1/p _ 1/2) ]/(p < 2); 
L>log 2 (6S-6). 

Note that r p = when p > 2 and that r p < (1 + v5)/4 for any p> 1. 

Let £(x) and be three times continuously differentiable everywhere 

except possibly x = 0, have finite fourth moments and satisfy the conditions 

(3.2) (Al) \C (k) (x)/C(x)\ < Q(l + |x| A «) fc , fc = 1, 2, 3, A ? > 0, 

(3.3) (A2) toj fc) (aO/»fe(aO| < ^(1 + |x| A ") fc , = 1,2,3, > 0, 

(3.4) (A3) \ Vj ( X )/Z(x)\<Cz >v , 

(3.5) (A4) xl +S 7] j (x 1 )<xl +8 7 ]j (x 2 ), x 1 >x 2 >C s >0,5>0. 

Condition (A4) just means that the functions \x\ 2+s r]j(x) are nonincreasing 
for sufficiently large x. Note that the constants A^, C v , C^ v , 5 and C$ are 
assumed to be independent of j which requires some kind of uniformity 
of the p.d.f.'s rjj. The consequence of these restrictions is that asymptotic 
expansions of integrals Iij defined in (2.9) are valid with absolute constants 
independent of j, so in what follows we shall suppress the index j in Iij,i = 
0,1, unless this leads to confusion. 

When conditions (Al) and (A2) hold with Ag = and A^ = 0, following 
Johnstone and Silverman [17, 19], we shall say that rjj and £ are heavy- 
tailed probability densities. The most common examples of the latter are 
p.d.f.'s of the double-exponential or Student t distributions. In this situation, 
asymptotic expansions 

(3.6) I (d) ~VjZ(vjd), \h{d)/h{d)-d\=0{v j /n) ifvj/^n^Q, 

(3.7) I {d)~V^ri(^d), \h(d)/I (d)\=O(^/u]) if Vj/y/n^ao, 

which are proved later in Lemma 2, are valid for any d as long as the relation 
between Vj and n holds. If A^ and are positive, then the expansions 
(3.6) and (3.7) can be used under some restrictions on d only. There is one 
important case though, when we can obtain asymptotic expansions for the 
integrals for any value of d: if £ and r]j are normal p.d.f.'s and the variances 
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of rjj are bounded from above and from below by a positive value, then 

(3.8) \h{d)/I {d)-d\=O(\d\u]/n) 

(3.9) I (d) ~ Vjfrjd) if Vj/y/n -> 0, 

(3.10) |Ji(d)/io(d)| =0{n\d\/v 2 j ) if Uj/y/K^oo. 
We denote 

(3.11) io = (2r + l)- 1 log 2 7i, 
and assume that the parameter Vj is of the form 

(3.12) v j = C 1 2 mj where m = r + 1/2 ifp>2. 

The expression for z/,- when 1 <p < 2 will be presented later [see (3.28)]. 

Remark 1. Assumptions about Uj can be translated into the ones on 
Tj n using relation (2.7), namely, Tj n = C\2 m:s / \fn. This coincides with the 
choice of Tj n for the normal-normal model in [1, 3]. 

Remark 2. The assumption that <p(x) and ip{x) are boundary coiflets as 
well as conditions (3.1) are introduced for the sake of obtaining convergence 
rates for a L 2 -norm-based risk function. All statements of the paper will be 
true for L = and an arbitrary s-regular scaling function ip{x) and wavelet 
ip{x) with s > max(r, r + 1/2 — l/p) if one replaces (2.11) by 

n 

Rn(Fj) = snpn-^Eifii/n) - f(i/n)] 2 . 

Remark 3 . Conditions on the existence of the fourth moment are purely 
technical conditions that are used for derivation of asymptotic expansions 
of the integrals hj{-)- These conditions can be dropped and replaced by 
requesting that the conclusions of Lemma 2 remain valid. These conclusions, 
however, have to be verified individually for each combination of £(•) and 
rjj(-). In what follows we shall consider the case when r]j and £(•) are p.d.f.'s 
of Student t and Cauchy distributions, respectively. Later, however, we shall 
see that it is somewhat beneficial to apply a distribution with faster descent 
at ±oo than the Cauchy. 

3.2. Optimality in Sobolev spaces and Besov spaces with p>2. It is well 
known that the Sobolev space H r can be characterized in terms of wavelet 

coefficients as Y%=o 8%(1 + 2 2j> ) < oo (see, e.g., [11], Section 9.2). 

Therefore, / belongs to a ball H r (A) in H r space with r > 1/2: 

oo 2 3 -l 

(3.13) feH r (A) <=► J2Y, § ]ka + 2 2jr )<A r>|. 

j=L fc=0 
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(3-14) feir M (A) => 2 ^e%<{lf_l^ +1/2 _ 1/p) 




for some B\ > 0. The cases p>2 and 1 < p < 2 apply to spatially homoge- 
neous and nonhomogeneous functions, respectively, and, as the reader will 
see later, the performance of Bayesian models varies greatly in those two 
cases. We shall start with the spatially homogeneous case p > 2 and study 
optimality of Bayesian estimators in this case. Since H r = -B^ 2> this is suf- 
ficient to study the general case of / € Bp (A) with p>2; the results for a 
Sobolev ball will immediately follow. 

In this subsection, we assume that r > 1/2 and p>2. We also assume 
that Uj is of the form (3.12) and Jq are such that 



Condition (3.15) is quite realistic and agrees with the assumption J — Jq — > 
oo as n — > oo above, provided r > 1/2. In practice, the normality assumption 
can be checked via level-by-level testing. 

In order to make understanding of a variety of assumptions and asser- 
tions of this section easier, we refer the reader to Table 1 in Section 4.2. 
In effect, we consider three kinds of models: (1) models with strictly opti- 
mal convergence rates (which include normal ^-normal rjj or heavy-tailed 
^-heavy-tailed rjj); (2) models optimal up to a log- factor (which include 
heavy-tailed ^-normal rjj); (3) suboptimal models (which include normal 
^-heavy-tailed rjj). Within the first set of models we also study what hap- 
pens when £ has a faster descent at ±oo than rjj, that is, condition (A3) is 
violated. 

3.2.1. Models with strictly optimal convergence rates: normal ^-normal rjj 
or heavy-tailed £ -heavy-tailed rjj. For those models convergence rates are 
given by 

Theorem 1. Let conditions (Al) and (A2) be valid with = \ v = 0, 
or, alternatively, conditions (3.8)-(3.10) hold for any value of d. Let also 
conditions (A3) and (A4) hold. If for some positive e 



(3.16) (3j n = 0{{^/vj) a ) with a<(2r + iy 1 (2 + 5)-l-e as j <j , 



(3.15) 



2 Jo> n l/(2r), 



then 



(3.17) 



R n (B; >g (A),f) = 0(n 




n — > oo,p > 2. 
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Corollary 1. IfVji') and^(-) are the p.d.f. 1 s of the double- exponential 
or the Student t distributions, or both £(•) and rjj(-) are normal p.d.fs, 
satisfying conditions (A2) and (A3) [and (A4) for the t distribution], then 
for (3j n given by (3.16) the risk R n (Bp q (A),f) is of the form (3.17). 

Note that in both Theorem 1 and Corollary 1 we assume that £(x) and 
r]j(x) have finite fourth moments. This is, however, a sufficient, not a nec- 
essary condition. Namely, the following statement is valid. 

Corollary 1*. Ifrjj(x) are the p.d.f.'s of Student t2 rnj +i distributions 
with mj integers, M\ < nij < M<i, and is the p.d.f. of the Cauchy dis- 
tribution, then R n (Bp q (A),f) is of the form (3.17) provided f3j n is given by 
(3.16). 

Now suppose condition (A3) is violated and £(x) has faster descent to zero 
than rjj(x) as \x\ — ► oo. Then, in order for convergence rates to still hold, 
we need to impose extra conditions on (5j n , namely, (5j n should be small for 
j < jo (he., a priori more coefficients should be kept at low resolution levels). 
Consider an alternative assumption to (A3): 

\ Vj (x)/li(x)\<U(\x\) 

(3.18) (A3*) 

where U(xi) > U{x2) for any x\ > X2 > 1. 

Theorem 2. Let the conditions of Theorem 1 hold with (A3) replaced 
by (A3*). Let also for some e > 

1 (2+5)/(2r+l)-l-e 

J 

l/(2r+l)+5-e 2 -i/ 2 



P jn = O ( min 

(3.19) 



U{2B 1 2^/ 2 ) 



3 < Jo, 



where B\ is the constant appearing in (3.14). Then R n (Bp (A), f) is of the 
form (3.17). 

Note that the constant B\ is usually unknown. If U{x) is a homogeneous 
function of some order [which happens, e.g., when rjj(x) and are the 
p.d.f. 's of t distributions], then the value of B\ has no bearing on f3j n . Oth- 
erwise, one can replace B\ by any function of n which grows infinitely as 
n — ► oo, for example, Inn. 

It is easy to see that condition (3.19) is more restrictive than (3.16). Con- 
dition (3.19) means that we a priori intend to keep many more coefficients 
than we "kill." 
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3.2.2. Models optimal up to a log-factor: heavy-tailed ^-normal rjj. Un- 
fortunately, for some combinations of distributions assumption (3.10) holds 
only when y/n\y/nd\ Xr] /vj — ► 0. This happens if r]j(x) are normal distribution 
p.d.f.'s and is a heavy-tailed p.d.f., for example, the double-exponential 
or Student t. In this situation we can ensure somewhat weaker conditions 
on \h(d)/I (d)\. 

Lemma 1. If £(x) and i]j(x) are even unimodal p.d.f.'s such that 
\I\{d)\ < oo for any j and d, then \I±{d) / Iq{(I)\ < \d\. 

Note that Theorem 1 and Corollaries 1 and 1* are valid for any distri- 
butions of Zj's in (1.1). However, if (3.10) does not hold for all values of d 
and the errors in (1.1) are not normally distributed, one needs additional 
assumptions to achieve near the optimal convergence rate. These can be two 
different kinds of assumptions: either fi(x) in (2.5) should decrease reason- 
ably fast at ±oo or we should "kill" nearly all coefficients at the highest 
resolution levels. In particular, we define Qo(n) to be the unique positive 
solution of the equation 

POO 

(3.20) / x 2 n(x)dx = n" 2r ^ 2r+1 \ 

Jgo(n) 

where fi(x) and a are as in (2.5), and denote 

(3.21) g(n) =max[g (n),aV2^2r(2r + l)- 1 Inn + 0.5 In Inn]. 

Theorem 3. Let conditions (Al)-(A4) hold with Ag = and \ v > and 
let /3j n satisfy (3.16). If //(•) is such that 

(3.22) lim n _1 /4r( e ( n ))A, = o i 

n — >oo 

or f3j n increases quickly as j > Jq: 

(3.23) = 0( Vj (2^n r /^ +1 ^/^), j > J , 
where C^ = f^°x/j,(x)dx, then 

(3.24) R n (B r pq {A),f) = 0{n~ 2r ^ 2r+1 \\nn)^^ 2r+1 ^, n^oo. 

Corollary 2. IfVji') are normal p.d.f.' s with variances bounded from 
above and from below by common positive numbers and £(x) is the p.d.f. of 
the double- exponential or t distribution, then (3.24) is valid provided (3.16) 
and either (3.22) or (3.23) holds. 
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Remark 4. Note that if the errors are normally distributed, then A^ = 1 
and g(n) = 0(y/\nn), so that condition (3.22) is unnecessary. If the errors 
follow the mixture model (2.5), condition (3.22) is valid whenever fj,(x) de- 
creases at an exponential rate or at a rate \x\~ a with a > 3 + 8r 2 A ?7 / (2r + 1) 
as \x\ — > oo. The latter requires relatively fast descent (e.g., a > 17/3 for 
r = 1 and = 1) which may not be true. To overcome this difficulty, one 
can choose large values for 0j n as j > Jo, suggested by (3.23). This measure 
will suppress coefficients at higher resolution levels and relax their depen- 
dence on a particular form of \x. 

3.2.3. Suboptimal models: normal £ -heavy-tailed rjj. By now, we have 
addressed all possible models except for the ones where the prior £(x) is 
a normal p.d.f. and the error distributions T]j(x) are heavy-tailed. In this 
situation, assumption (3.8) may be invalid: instead of (3.8), by Lemma 1, 
we have 

(3.25) \h{d)/h{d)-d\=0{\d\) if Vj \vjd\ x$ -/ y/n > M. 

This happens when, for example, the rjj(x) are p.d.f. 's of the double-exponential 
or the Student t distribution. In this situation, convergence rates (3.17) or 
(3.24) do not hold. 

Theorem 4. Let conditions (Al) and (A2) be valid with Ag > and 
X v = 0. Let conditions (A3*) and (A4) hold and (5j n satisfy (3.19). Then 

(3.26) R n (B r p q (A), /) = 0(n~ 2r ^ 2r+1+x ^), rw oo. 

Theorem 4 gives an upper bound for the risk. However, in order to con- 
clude that certain combinations of distributions are asymptotically inferior 
to the others, we are interested in the lower bound for the risk. 

Corollary 3. // r]j(x) = n(x) are identical p.d.f.'s of the double- 
exponential or the Student t distribution and £(x) is a normal p.d.f., then 
for some positive C 

(3.27) R n {H r (A)J)>Cn- 2r ^ 2r+2 \ oo. 

Observe that the lower bound in (3.27) is identical to the upper bound 
in (3.26) (since A^ = 1 for the normal p.d.f.) and both are asymptotically 
larger than the optimal convergence rate (3.17). This is due to the fact that 
the bias of the estimator / converges at a slower rate. The latter happens 
because the model with the above combination of £(•) and n(-) fails to adapt 
to sparsity, namely, to the situation when at lower resolution levels we have 
very few relatively large coefficients. This is not surprising since in Corollary 
3 we have a combination of a rather flat error model and a sharp prior p.d.f. 
which may fail to capture the actual value of the estimated coefficient. 
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3.3. Optimality in Besov spaces with 1 < p < 2. It follows from the pre- 
vious section that the models which satisfy Corollary 1 attain the opti- 
mal convergence rate in Sobolev spaces and spatially homogeneous Besov 
spaces Bp q , p>2, with minimal assumptions whether the errors are nor- 
mally distributed or not. Now, the question is whether all those models also 
achieve the optimal convergence rate in spatially nonhomogeneous spaces 
1 < p < 2. Before giving an answer, we need to introduce new values 
of parameters for this case. Denote 

ji = [(r + 1/2 - l/p)(2r + l)]- 1 rbg 2 n, 
J = 0.5[log 2 n + ji], 
and let jo and Vj be defined by (3.11) and (3.12), respectively, with 

fm 1 = r + l/2-0.5(l/p-l/2) I j<j , 

(3.28) m=l m 2 = (r + 1/2)- (1/p- 1/2) (1 + 1/r), jo<3<h, 

[m 3 = r + 1/2, j>h- 

Observe that m = r + 1/2 for all resolution levels whenever p = 2. Note also 
that the resolution level j\ is chosen so that 

oo V-l 

(3.29) E E^ = °(-~ 2r/(2r+1) )- 

j=ji k=o 

The restriction r > r p ensures that J — j\ — > cxd, so that one can choose 
resolution level Jo, ji < Jo < J — 1, such that J — Jo — > oo. Assume that 

/oo 
z 2 exp(-z 2 /2o- 2 )[r ?j (l + \z\)]~ 2 dz < C, 
-oo 

where a is the TRUE standard deviation of the error and C is a constant 
independent of j. Again we have the same three types of models as we had 
in the case of p > 2. However, neither of the first two types of models delivers 
strict optimality when 1 < p < 2. The third type of model is suboptimal even 
when p > 2, so we do not consider this type of model here. 

3.3.1. Normal £ -normal rjj or heavy-tailed ^-heavy-tailed rjj models. Here 
we consider the subset of models studied in Section 3.2.1, namely, the models 
where the tails of £ are at least as heavy as the tails of r\j. 

Theorem 5. Let f £ B T pq (A) with l<p<2 and r > max(r p , 1/p). Let 
£ and i]j satisfy conditions (Al) and (A2) with X v = Ag = or let both of 
them be normal p.d.f.'s. Assume that (A3) and (A4) hold and 



(3.31) 



Vj(x) <C(x 2 + l)-^ /2 exp(-A|x| 7 ). 
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If 13 jn = (vW v jT f or 3 < h where 

' qi G (-oo, min(i/ - 3, (2r + 2 - - 1)), 



(3.32) a=< 



a 2 G 



if 3 < jo, 

0,,-2 + r + 1 ~ 2/P 



2r(r + l) + 1 -p~ 1 (2r + 2)_ 

ifjo<j<3i, 

whenever 7 = m (3.31), and a > whenever jo < j < ji and 7 > m 
(3.31), i/ten 

(3.33) R n (B T p q (A), f) = 0(n- 2r /( 2r+1 ) +£l (lnn) £2 ), n - 00. 
if ere 

(3.34) 



ei = 0, e 3 = 8r/[7(4r+p(2r + l))], */7>0, 
e 1 = [a*(l/p-l/2)]/[i/(2r + l)]>0, e 2 = -p, if 1 = 0, 



with a* = max(l + ai,(l + a 2 )(2r + 2)/r). Ifa 2 = 0, then a* = {2r + 2)/r, 
the minimum possible value, and £\ = (2r + 2)/[rv(2r + 1)] for 7 = 0. 

Theorem 5 shows that the models where the rjj's have exponential de- 
scent achieve optimal convergence rates up to a logarithmic factor while rjj 
with polynomial descent lead to suboptimal convergence rates. Observe that 
condition (A5) does not exclude the normal model; it just requires that the 
variances of rjj be larger than the actual variance of our data. In particular, 

if r]j(x) = {<j2~Koj)~ 1 exp(— x 2 /2<r|), condition (A5) implies that a 2 > 2a 2 . 
This does not contradict Abramovich, Amato and Angelini [1], who consid- 
ered only the case of Oj = a. 

3.3.2. Heavy-tailed ^-normal rjj models. Now we study the models pre- 
viously discussed in Section 3.2.2. These models under the assumption that 
rjj are identical and the actual errors are normally distributed were investi- 
gated in depth by Johnstone and Silverman [19], who demonstrated strict 
optimality of the models in an empirical Bayes setup in a wide variety of 
Besov spaces. However, since Johnstone and Silverman [19] considered em- 
pirical Bayes estimators, we cannot just reproduce their results here. 

Theorem 6. Let f e B r (A) with 1 < p < 2 and r > max(r p , 1/p) and 
let conditions (Al)-(A4) be valid. Let r)A-) be normal p.d.f.'s and = 0. 
Assume also that either condition (3.22) or (3.23) is valid. If(3j n = {V^/ U j) a 
where a > whenever jo < j < ji , then 

(3.35) Rn(B r pq (A),f) = 0(n- 2r /^(lnn)^), 

where e 3 = max(l/[2r + l],4r/[4r + p(2r + 1)]). 
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3.4. Optimality of the models with mixture distribution for errors. In 
Sections 3.2 and 3.3 we considered only error distributions rjj which satisfy 
certain assumptions uniformly. There is, however, one more interesting case, 
namely, that of %■(•) being a mixture distribution mimicking (2.5): 

(3.36) Vj (x) = (1 - A*)(v^ao)^ 1 exp(-rE 2 /(2a2)) + \*((x), 

where A| = for j < Jq. Let be a heavy-tailed distribution (i.e., A<j = 0). 

Theorem 7. Let A^ = and £(x) > Cg j0 exp(-x 2 / (2cr§)) /or any x. Lei 
/3j- ra satisfy the assumptions of Theorem 1 (p > 2) or Theorems 5 and 6 
(1 <p< 2). If £(x) is a p.d.f. of the normal distribution, then as n — > oo 

R n (B r M (A)J) 

(3.37) rO(n-V(2r+i)) ) */p>2, 

i 0(n^ 2r /( 2r+1 ) (In n )4r/(4r+p(2r+l)) ) ( if\< p< 2, 

the second assertion being valid provided a 2 > 2a 2 , where a 2 is the true error 
variance given by (2.5). If £(x) «s a heavy-tailed p.d.f. (Ag = 0) and and 
Aj are snc/i #iai 

(3.38) jn >Po and \*<C* X , j > J , 
i/ien as n — > oo 

(3.39) ^(^(A),/) = 0(n- 2 ^ +1 )(lnnr) 

where e = l/(2r + 1) z/p > 2, and e = £3 (n'i>en 6?/ Theorem 6 if 1 <p <2. 

The model (3.36) behaves exactly as the model with normal errors for 
j < Jq . The advantage of adding a heavy-tailed term for j > Jq is that even 
in the case when is a heavy-tailed prior, no additional assumptions on 
the distribution /i(x) or f3j n are necessary. 

3.5. Does f £ B r pq a priori? More about prior odds (3j n . So far, the choice 
of the error model ijj(-) was in the limelight. The main assertion about £(•) 
was that it should not have faster descent at ±00 than rjj(-). However, it 
is £(•) that determines whether the regression function f(x) belongs to a 
Besov space W a priori. Namely, the following statement is valid. 

Theorem 8. If £(0), Uj and (3j n are such that 

/oo 
\9\ max( - p ' q ^(e)d6<oo, 
-00 

00 

(3.41) lim E [2^ r+1 ^(3- n 1/p iyrY m{p ' q) < 00, 

n ^°°j=L 

then f £ B r Jhq a priori with probability 1. 
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It is easy to see that condition (3.40) requires £(•) to have at least max(p, q) > 
1 finite moments, which immediately eliminates the Cauchy prior from the 
list. On the other hand, any prior with exponential descent ensures validity 
of (3.40). 

Corollary 4. Let p > 2, q > 1 and (3j n = {y/n/vj) a with a > as 
< j < jo- Let also 

(3.42) lim V (3- n mHq/pA) < oo. 

// t/ie values of Uj are given by (3.12) and satisfies assumption (3.40), 
then f G a priori with probability 1. 

Note that in the case when condition (A3) holds (Theorems 1 and 3), 
Corollary 4 can be applied. In order for one to be able to choose a > in 
Theorem 1, condition (A4) should hold with S > 2r — 1. Condition (3.42) is 
needed since Theorem 1 puts absolutely no restrictions on the values of /3jn 
for j > jo. The inequality (3.42) holds when, for example, [3j n = {vj/^Jn) ai 
with ai > 0. The restriction (3.23) does not affect (3.42) since it imposes not 
small but large values of j3j n for j > Jq. However, the case when condition 
(A3) is violated is troublesome since it calls for smaller values of Pjn [see 
(3.19) in Theorem 2]. 

Corollary 5. Let l<p<2 and Uj and (3j n be determined by Theorem 
5 or 6. If (3.40) holds, (3j n = (-y/n/z/j) for j < ji where 

a 1 e(2p,oo), ifj<jo, 
a 2 e (2p(r + l),oo), ifjo<j<ji, 



(3.43) a 
and 



(3.44) lim £ IP^T' 1 ^ < oo, 

then f G Bp q a priori with probability 1. 

Note that Corollary 5 is applicable under the conditions of Theorems 5-7 
whenever assumption (3.31) on rjj(-) holds with 7 > or v large enough, 
so that conditions (3.32) and (3.43) on a can be satisfied simultaneously. 
Again, assumption (3.44) is necessary since Theorems 5-7 put almost no 
restrictions on f3j n when j > ji . 

Corollaries 4 and 5 imply that in order to ensure / G Bp a priori almost 
surely, one needs fairly large values /3j n , for example, such that the sums 

2~2 /3j n mm ''' ?//p ' 1 ^ are uniformly bounded [see (3.42)]. This fact motivates the 
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choice 7Tj n = 2 5 in [1, 3], which is equivalent to [3j n ~ 2- ?b . In the present 
paper, however, we choose /3j n = (y/n/uj) a , so that /3j n ~ 1 at the "optimal" 
resolution level jo. Is this a reasonable choice? The answer is "yes" if one 
wants to obtain the optimal convergence rate for a wide range of models. 

Theorem 9. Let conditions (Al)-(A4) be valid. Assume that p>2, Vj 
are defined by (3.12) and the densities ijj(x) = n(x) are identical. If iTj n = 
0(2~ b i) with b> 0, then for some positive C 

Rn(B r p>g (A)J) > Cu(n)n- 2r /^ 

With U(n) = fo-l(„-&/(2r+l))]2r/(r+V2) ^ ^ 

In particular, u(n) ~ is a normal p.d.f., u(n) ~ 

(lnn) 4r// ( 2r+1 ) ifrj(x) is a double- exponential p.d.f. andu(n) ~ n br ^ r+1 ^ 2 ^ ( 2 + 5 )l 
if r/(x) ~ |x|~~( 2+<s ) as >oo. 



4. Discussion. 



4.1. Discussion of the simulations. The case of the present paper is an 
occasion where extensive finite sample simulations have been carried out a 
few years before theoretical properties of the estimators have been studied. 
Clyde and George [8] discuss wavelet regression with errors being a scale 
mixture of normal errors and the priors on wavelet coefficients of the forms 

°jk\^*jk^jk^ N(0,a 2 Cj-f jk /X* k ), A* fc ~/i*, 7 jfc ~ Bernoulli^-). 

They consider threshold shrinkage and multiple shrinkage estimators, both 
based on the posterior mean, with the difference that the first one is cal- 
culated conditional on jj k and data, while the second one is conditional 
on data only. Similarly to Johnstone and Silverman [17, 19], the authors 
estimate hyperparameters by maximizing marginal likelihood. They show 
that their estimators are computationally competitive with standard classi- 
cal thresholding methods and also perform well in the case of nonnormally 
distributed errors. It is easy to check that the multiple shrinkage estima- 
tors of Clyde and George [8] coincide with the ones proposed in the present 
paper. Simulation study shows (see Figure 2 of their paper) that multiple 
shrinkage estimators are superior to the threshold shrinkage ones; therefore, 
the authors use multiple shrinkage estimators in their later simulations. 

Clyde and George [8] suggest several choices for the prior and error dis- 
tributions (normal prior-normal errors, normal prior-Student t u errors, t u 
prior-t^ errors, etc.), but the only cases that made it to the actual simula- 
tions are normal prior-normal errors (N), Cauchy prior-ts errors (C5) and 
uncorrelated but dependent £5 errors and prior (T5). Note that all of the 
three models satisfy Corollary 1 and are asymptotically optimal in B r v with 
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p > 2 without additional assumptions on the error. Hence, though no studies 
for the less favorable cases are presented in the paper by Clyde and George 
[8], they probably have been eliminated as the ones producing inferior re- 
sults. 

Clyde and George [8] run simulations for the by now traditional test func- 
tions "blocks," "bumps," "doppler" and "heavisine" proposed by Donoho 
and Johnstone [12] and report results for three types of actual error distri- 
butions: normal, t§ errors in wavelet domain and t§ errors in data domain 
(see Figures 3, 5 and 6 of their paper). Before discussing Clyde and George's 
[8] findings we want to draw the reader's attention to the fact that their 
enumeration of resolution levels is just the opposite from ours: their highest 
resolution level is the most coarse while ours is the finest one. 

Clyde and George [8] state that empirical Bayes wavelet estimators based 
on the models N, C5 and T5 are in the majority of cases superior to the 
traditional thresholding rules. However, they have no tool to explain the 
discrepancy in the performances of these three models, and discrepancies in 
the performance of various Bayesian models in general. We now are ready 
to make this final step. 

While performing simulations with normal errors the authors discovered 
that model N gives somewhat better precision than the T5 and C5 models. 
The results reverse when the error has the Student t§ distribution in the 
wavelet domain: models T5 and C5 lead to smaller MSE than model N. Clyde 
and George [8] remark that "performance of all the estimators tends to be 
worse under heavy-tailed error distributions" and that "the performance of 
N worsens the most." However, this is not surprising in view of the findings 
of Section 3.2. Note that heavy-tailed error distributions lead to increase 
in variance, and model N is the most vulnerable to this increase. Recall 
that at high resolution levels the variances of Bayesian estimators of the 
wavelet coefficients are bounded by a constant times n/uj for the models 
with Ajj = 0, while for the model N they are proportional to n/v'j times the 
variance of the wavelet coefficient. This is the reason why the growth of the 
variance decreases efficiency of model N the most. The same phenomenon 
(although somewhat milder due to the central limit theorem) carries over to 
the case when the errors have t§ distributions in the data domain. 

The above also explains why Bayesian models show the least discrepancy 
for the "bumps" test function. This function, as Figure 4 of [8] shows, has 
the smallest value of prior variance which translates into higher Vj in our 
notation. This leads to a higher proportion of the bias component in the 
overall error, and, by far, the highest value of the overall error among the 
four test functions. Hence, the variance component has the lowest weight in 
the overall error for the "bumps" function, so that all three models, N, C5 
and T5, show similar performance in this case. 
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4.2. Discussion and summary of the models. Table 1 summarizes com- 
parison of the Bayesian models carried out in the previous section. We as- 
sume that rjj(-) =T}(-) are identical and consider three choices for £(•) and 
rj(-): the normal, the double-exponential and the t distribution. 

The choices of £ and r\ are listed in the first row and in the first column, re- 
spectively. The table presents asymptotic expressions for n 2r ^ 2r+1 ^ R(n, W JA)), 
which we denote by Ai when p > 2 and A2 when 1 < p < 2; hence, Ai and A2 
show deviation from the "ideal" rate 0(n~ 2r ^ 2r+1 ^). We also introduce the 
common parameter g = 4r -\-p{2r + 1). The case of the mixture distribution 
for errors is not covered by the table since it mimics the normal 77-normal 
£ or normal 77-heavy-tailed £, depending on whether £ is normal or heavy- 
tailed, with the only difference that the restriction (3.22) or (3.23) on the 
unknown distribution // or (3j n , respectively, is unnecessary. To be more spe- 
cific, we distinguish the situations when we have a mixed model (2.5) or 
a normal model [Xj = in (2.5)] for errors. If both models give the same 
results with no additional assumptions, we leave the cell of the table un- 
marked; otherwise, if assumption (3.22) or (3.23) is required, we mark the 
cell with 0. The cells where the additional assumption (3.19) on [3j n , j < jo, 
is required are marked with A. 

Table 1 provides a comprehensive comparison of the models. As one can 
see, for spatially homogeneous Besov spaces (p>2), the models of choice 
are the models which provide the optimal convergence rate, N-N, DE-DE, 
DE-T and T-T, where N, DE and T stand for normal, double-exponential 
and t distributions, respectively; the first letter gives the choice of r\ and 
the second that of £. These models provide optimal convergence rates with 
very few restrictions [only (A3) and (A4)], and no additional assumptions 
are required if the errors in (1.1) are not normally distributed. The mod- 
els N-DE and N-T provide optimal convergence rates up to a logarithmic 
factor; however, the main flaw of these models is that they are sensitive to 
slow descent of the error distribution \i at ±00: if (3.23) is not guaranteed, 
additional assumptions (3.22) on (3j n for j > Jq are necessary. To avoid this 
feature of N-DE and N-T models one can use models with the mixture error 
distribution (3.36), where £(•) is a heavy-tailed p.d.f. We shall denote these 
models by N£-N, N£-DE and N£— T, depending on the choice of £. Note 
that models N£-DE and N£-T behave similarly to N-DE and N-T but do 
not require assumption (3.22) or (3.23). 

The situation changes if one considers spatially nonhomogeneous Besov 
spaces (1 < p < 2). In these cases, the model T-T becomes suboptimal while 
N-N, N-DE, N-T, DE-DE and DE-T still provide optimality up to a log- 
arithmic factor. The model N-N, however, requires the very restrictive as- 
sumption <7q > 2a 2 , which can cripple empirical Bayes inference on param- 
eters of the model. The models N-DE and N-T still require additional as- 
sumptions on j3j n for j > Jq in this case. Hence, the models of choice in this 



Table 1 
Comparison of various models 
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'2-Kcr 
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1 / a; 

exp — - — ■ 

2o"i V (Ti 



t-distribution 

r(K + i)/2)(^ l7 r)- 1 / 2 
r(i/i/2)(i + (x 2 /fi)) Cl/1+1)/2 



Normal 
1 



exp 



Double-exponential 

1 / lasl 
■exp 



2cr 







era 



t-distribution 

r(i/ /2)(l + (avVi/o))^ 1 )/ 2 



Ai = 0(l) 

if (To < fi or (3.19) 
A 2 = 0((lnn) 4r ^) 

if \/2o" < CTO < (Tl 



Ai = 0(n 



2r/((2r+l)(2r+2))l 



Ai = 0(n 



A 

2r/((2r+l)(2r+2))> 



A 



A 1 = 0((lnn) 1 /^+ 1 J) 



A 2 = 0((lnn 
k = max 



( 4r 1 



U'2r + 1, 

A 1= 0(1) 

if (To < ai or (3.19) 
A 2 = 0((lnn) 8r ^) 

if (TO < (Tl 

Ai=0(l) 



A 1 = 0((lnn) i /^+ 1 ') 



A 2 = 0((lnn) K ) 

(Ar 1 
= max I — , 



V f ' 2r + 1 
Ai=0(l) 



A 2 = 0((lnn) 8r ^) 



Ai=0(l) 

if fo > v\ or (3.19) 

A 2 = 0(n ro (In n)- p ) 

if fo > v\ 

2r + 2 

r(2r +!)(! + vg) 



A i =n 2r/{2r+1) R n {B; q {A)J) withi = l if p> 2 and i = 2 if 1 < p < 2. 
? = 4r +p(2r + 1), a 2 is the true error variance. 

requires assumption (3.22) or (3.23) for non-Gaussian errors in (1.1). 
A requires assumption (3.19) on P jn , j <jo- 
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situation are DE-DE, DE-T (although they have a slightly higher power of 
the log-factor), N£-DE, N(-T, or N-DE and N-T if one is sure that the 
error distribution has fairly fast decay. 

The T-DE model provides optimal convergence rates in spatially homo- 
geneous Besov spaces but only under condition (3.19) which, as one can see 
from Corollary 4, prevents / £ B r m a priori. Hence, this model is inferior to 
DE-DE, DE-T and T-T and, for this reason, has not been studied in the 
case of p < 2. Note that N-N, DE-DE and T-T require condition (3.19) if 
do > 0"i or vq < v\ . 

The least advantageous choices in terms of asymptotic convergence rates 
are DE-N and T-N. Not only do these models fail to provide optimal conver- 
gence rates but they also require assumption (3.19) which prevents / E B T pq 
a priori. 

Section 3.5 allows us to supplement our comparison with some restrictions 
on £(•). In order that / G B T pq , condition (3.40) requires p.d.f. £ to have at 
least max(p, q) finite moments. Hence, condition (3.40) prevents f & B poo 
whenever £ is a p.d.f. of the t distribution. In general, Theorem 8 calls for 
prior distributions with faster descent at ±oo. However, it should be noted 
that the considerations of Theorem 8 have no bearing on the asymptotic 
convergence rates of the estimator. 

Finally, some recommendations for a choice of a model can be drawn. If 
the errors in (1.1) are normally distributed, one can use DE-DE, DE-T, N- 
DE and N-T models which ensure optimality (up to a logarithmic factor) in 
both spatially homogeneous and nonhomogeneous Besov spaces. However, 
if the normality assumption is violated and errors may have a heavy-tailed 
distribution, DE-DE, DE-T, N£-DE and N£-T will be safer choices. In ad- 
dition, if one wants to be true to the Bayesian spirit and make sure that 
/ € Bp q a priori for any possible q, then the models where £ decreases expo- 
nentially (i.e., DE-DE and N£-DE) may seem more preferable. 

5. Proofs. 

Proof of Lemma 1. Since Io(d) is an even and I\{d) is an odd function 
of d, consider the case d > 0. Note that 



since the integral in (5.1) is nonnegative. Thus, I\(d) < dIo(d). □ 



(5.1) 




Proof of Theorem 1 is based on the following lemmas. 
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Lemma 2. If Vj\d\ is bounded or Vj\i/jd\^£ / y/n —> 0, then as Vjj^fn— >0 
(5.2) / (d) = ^e(^-rf)[l + 0(™~\ 2 M| 2A «)], 



/i(d) i/j 

— — d — T]2 — 



(5.3) 



M<0 



1 + 



Vj\vjd\ 



2\c 



71 



71 



If i/n\d\ is bounded or y/n\y/nd\ Xr > jvj — > 0, then as \fnjvj — > 



ii 



l + 0( ^\Vn~d\ 2X ^ 



(5.4) I (d) ~ ^(V»w0[l + O(ni/J' i |- N /ndr ")] 
^ jjj^ * y/nrj'(y/nd) 

//ere ?? 2 = /f^ x 2 rj(x) dx, £ 2 = /f^ x 2 £(x) ofe. 



Proof. We shall give the proofs for (5.4) and (5.5); the proofs of 
(5.2) and (5.3) are conducted in a similar manner. Change variables y = VjX 
in (2.9) and use Taylor series expansion: 



h{d)= f°° y U^d-^y\(y)dy 



n(^ynd) 



n 



71 



j 



6^ 3 



yV"(v^d) + 



t,(y)dy- 



Letting i = and i = 1 in (5.6) and taking into account that £(•) is an even 
function, we obtain (5.4) and (5.5). □ 

Lemma 3. Let g(n) be defined by (3.20) and (3.21) and let a be as in 
(2.5). Then 



i = l,2; 



(5.6) 

(5.7) P(VH\d jk -9 jk \ >aVh^) = o(n- a2 ^ 2 ^), j < J ; 

(5.8) E[(d jk - 6 jk ) 2 I(^l\d jk - 9 jk \ > g(n))} = 0(n" 4r /( 2r+1 )). 



E(d jk -9 jk y i = 0(n- 1 ), 

-a 2 /(2a 2 ^ 
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Proof. Validity of statements (5.6) and (5.7) follows directly from the 
fact that [cf. (2.5)] 

/ 5 g ^ d jk - 6 jk ~ (1 - A i )^/n(\/27rcr)" 1 exp{-nx 2 / (2a 2 )} 

1 ' ' + Xj^fi(^nx). □ 

Lemma 4. Let <f> and ip be boundary coiflets introduced in [18, 19] pos- 
sessing s > r vanishing moments and based on orthonormal coiflets supported 
in [-S + 1, S], s < S. Assume that p>l, r > max(l/2, 1/p, r p ), where r p is 
defined in (3.1) and L > log 2 (6S' — 6). If f £ B pq (A), then for some absolute 
positive constants A\ and Ai 

( 5 - 10 ) E E^i* - 3k f < iin-W), 

j=L k=0 

(5.11) ^9% < A 2 2~ 2 ^-^- 1 / 2 M. 

k=0 



Proof. The proof is based on Proposition 5 of [18], which under con- 
ditions of Lemma 4 can be written as 

(5.12) E \9 jk - d jk \P < AC{r lP ^^)2~^ + ^- l l 2 ^. 

k=0 

To prove (5.10), consider cases p > 2 and 1 < p < 2 separately. If p > 2, the 
Holder inequality leads to EkJoH^jk-Ojk) 2 = 0(2- 2rJ+2 ^P~ 1 / 2 ^2^ 1 - 2 M) = 
0(2~ 2rJ ). Adding the terms together we obtain 

ESVi* - 9jk) 2 = of E 2 " 2rJ ) = 0(n _1 lnn) = 0(n~ 2 ^ 2 ^). 

j=L k=0 \j=L J 

if i < P < 2, then El=o(djk - e jk ) 2 < (Eto 1 !^ - %fcl p )) 2/p = 

( 2 -2rj+2(i/ P -i/2)j)_ Summing the terms, we arrive at E/=| Efclfo 1 - 

9 jk ) 2 = 0(2- 2J ( r+1 /2-l/p)) = ( n -2r/(2r+l)) provide d r > rp . 

To derive (5.11), note that 9 G B r pQO (A), where with some abuse of notation 
we use -Bpoo to denote Besov space of infinite sequences. Hence Efc^ 1 &]k = 
^ 2 2j(r-(i/ P -i/2) + )^ which in com bmation with (5.10) yields (5.11). □ 
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Proof of Theorem 1. Since the wavelet basis is orthonormal, 

2 L -1 



R n (B r M (A)J) = £ E(9 k -& 2 



(*> -13) J-12-j-l oo 2J-1 

j=L k=0 j=J k=0 

Observe that the first term in (5.13) is bounded by 2 V^^, = Q 1 [Var(^/ c ) + (9 k — 
9~k) 2 ] = 0(n _1 ), while the last term is bounded by A2~ 2rJ = 0(n~ 2r ) due to 
(5.11). By Lemma 4, the second term in (5.13) is dominated by 

J-12J-1 J-1V-1 

(5.14) E E $ik - hk? < 2 E E S fe " 6 ok? + 4jrT 2r , 

j=i k=Q j=L k=0 

that is, the main contribution to R n (B r pq (A), f) is made by the first term 
in (5.14). Therefore, we need to construct an asymptotic upper bound for 

R = YfjZl Eto 1 E 0jk ~ d Jk) 2 = Ri + R2 with 

jo 2^-1 J -I 23-1 

(5.15) = E E ^fe - M 2 . « 2 = E E E &k - °jk) 2 . 

j=L k=0 j=jo+l k=0 

Let us examine each of the terms in turn. Denote 

(5.16) A jn (d) = fy n V^Vj(V^d)/I (d) 
and note that Ri < 2(R\\ + R12) where 

jo V-\ , j / , s v 2 

(5 17) 3=Lk=Q {jh) 

r - V V 1 F { h(dj k )/I (d jk ) h(d jk ) \ 2 

12 V l + A jn (d jk ) Io(d jk )) ' 



j=L k =o x * 1 h(djk). 

To establish an asymptotic upper bound for R\\ observe that by combination 
of Lemma 2 and (3.8), for j < jo 

E(h(d j k)/h(d j k)-9 jk f 
1 ^ < 2[E(I 1 (d j k)/h(d jk ) - d jk f + E(d jk - 9 jk f] 

[b 8j = 0(E(ujd 2 jk /n 2 ) + u 2 /n 2 + a 2 /n) 

= 0( V p 2 k /n 2 + v 2 /r? + a 2 /n), 

so that by (3.14) 

(5.19) i? u = 0^[2- 2 ^Vn 2 + 2^|/n 2 + 2Vn]j = 0(n" 2r /( 2r+1 )). 
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In the case of R±2, note that R±2 = -R121 + -R122 where 

(5 20) i " 1 - 4 J»(<W 

j=L k=0 \lo{ajk) 1 + A jn {d jk ) 
For iii2i , recall that since Vj < s/n and by assumptions (A3) and (A4) 



(5-21) W^4<Q 



(y/nd jk ) 2+d r] j (y/nd jk ) 



(, 2 (vjdjk) {^jdjk) 2+s i]j(^jd jk ) 



n 2+8 — n 2+5 



By (3.16), (5.16) and since 5 + (2r + l)" 1 > (2r + 1) _1 ( 2 + 8) - 1, the latter 
implies that 

/ io 2^-1 \ 

/r 20) \j=L k=0 

A „ /V?\ V(2r+l)+6 



^-2r/(2r- +1 )^^2 n (^) j = 0(n -2r/(2r+l) } 



For the sake of construction of an upper bound for -Ri22> note that since 
Vj\dj k \ < C s we have h{d jk ) ~ Vj£(vjd jk ) > Vj^{C dd ). Thus, by x/(l + x) x 
min(x,l), we derive that Aj n (dj k ) = 0(min(l, 0j n nvj 2 rj 2 (\fndj k )). Let ay 
be a unique positive solution of f) 2 n nv~ 2 r] 2 {aj n ) = 1, that is, 

(5.23) ttjn = 77 ■ 



.1 >i 



Since r/j(x) are decreasing as x > 0, Aj n (dj k ) < 1 iff |djfc| > aj n /y/n. Hence, 
by condition (A4), E\d) k min(l, A 2 n (d jk ))\ = E\d 2 k I{^i\d jk \ < a jn )) + 
E [d]k(3jn nu i 2 rf(Vnd jk )I(Vn\d jk \ > a jn )], so that 

E[d 2 k min(l, A 2 jn (d jk ))] < a 2 jn /n + f3 2 n nvj 2 r] 2 {a 2 n )a 2 n /n = 2a 2 jn /n, 

(5.24) 

since xrjj(x) are nonincr easing functions of x for x > 0. Now, note that, 
as > (7a, by condition (A4) |x| 2+5 ? ?j (x) < C^rj^Cs) < Cf + ^(0) < 
C| +5 (7^^(0) = C*. Therefore, 77,- (a?) < C^ ] \x\-^ for any x and rfj l {z) < 
Ctp z-W+Q . Let /3 jn = {^hjvjY. Then, from (5.16) 

(5.25) a jn = njWvj/M*") = 0([y^/^-] (1+a)/(2+5) ). 
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Combining (5.20), (5.24), (5.25) and Lemma 1, we derive 

/ jo 23-1 \ 

^122 = E E E[d%min(l,A 2 jn (d jk ))} 

\j=L k=0 ) 
/ jo 23-1 



(5.26) 



°(E E"" 1 !"/^ 1 ""^ 

\j=L k=0 

(to 2i 

V T 

\j=L 



77? 



(l+a)/(2+5)^ 



JO 



n 



2 (2r+l)j 



[l/(2r+l)-(l+a)/(2+5)]^ 



2r/(2r+l) ^^ 
= 0(n -2r/(2r+l) ) J " L 

since (l + a)(2r + l) <2 + 5. Combination of (5.19), (5.20), (5.22) and (5.26) 
ensures that i?i = 0(n~ 2r /( 2r+1 )). 

Now, consider R2 given by (5.15). Since \9jk\ < \h(djk)/ Io(djk)\, by com- 
bination of Lemma 2 and (3.10) 



(5.27) 



/ J-l 2J-1 

^ = o E E 

\j=j +l k=0 
/ J-l 23-1 

= E E 



/i(^- fc ) x 

lo(djk) 

,2^2 



9 2 



n n z Ed jk 



V=jo+i fc=o 

Vj=jo+l 
0(ri -2r/(2r+l) )) 



which completes the proof. □ 



Proof of Corollary 1. Proof follows directly from Theorem 1 by 
Lemma 2 whenever X^ = X v = 0. If £(x) and ryj(x) are normal p.d.f.'s, validity 
of (3.8)-(3.10) can be verified by direct calculation of Io(d) and I\{d). □ 



Proof of Corollary 1*. Using properties of the Fourier transform 
and formulae (8.432.5), (8.468), (3.944.5) and (3.944.6) of [15], one can show 
by direct calculation that \Ii(d) / I${d) — d\ = 0(\d\iSj /^/n), Io(d) ~ Uj^Vjd), 
if Vj/yjn^> and \I\(d) / Io(d) \ = O^^/nldl/vj) if Uj/^/n^ 00, so Theorem 1 
remains valid. □ 



PROOF of Theorem 2. It is easy to see that condition (A3) is used only 
for derivation of the -R121 term in (5.20). Since I{vj\djk\ > C$) < I(^Jn\djk — 
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9jk\ > ay/\nn) + I(uj\dj k \ > Cg)I(vj\dj k \ < Vj\9j k \ + auj / \/ln n^/n ) , we have 
-R121 = -R1211 + -Ri2i2- Here, by Lemma 3, 

/ jo 2i-l \ 

i?mi = o E E E[d%i(VE\d jk - e jk \ > aVi^l)] 

\j=L k=0 J 



/ jo V-l 

° E E i E ( d Jk - Ojkf + tf k P(yfr\d jh - e jk \ > 

\j=L k=0 



av Inn 



30 



0(n -2r/(2r+l)) + O £ 2- 2 ' r n-° /2<J 



0(n 



-2r/(2r 



provided a 2 > 4ra 2 /(2r + 1). For -R1212) comparing with (5.21) and (5.22), 
take into account that z^l^jfel < B\2^ 2 and Vjy/hin/y/n = o(2 J / 2 ). Using 
(3.19), obtain 



J o 2*-l r o2 



^=0 EE 

\j=L fc=0 



2 \ 1+5 



Vj\0j k \ + a— F=vln 



n 



n 



x [7 2 Vj\0j k \ +a-p=Vln 



n 



\j=L 



° E^(^) 2i^(2B 1 2^)U0(n- 2r /^ 1 )). 



□ 



Proof of Theorem 3. Note that since condition (3.10) is no longer 
valid, we need to derive new upper bounds for R2. Note that \9j k — 6j k \ < 
\6 jk \ + \9 jk \ and E/^+iEto 1 ^ = 0(n _2r/(2r+1) ), so we shall be con- 
cerned with the \9j k \ term only. Partition J2jJ2k@j k the sum over 
jo < i < ^0 an d Jo + l<j<J — 1 and denote the respective sums by R3 
and i?4. 

To analyze R3 note that < \Ii(dj k )/Io(dj k )\ and that 

^ 5 1 < I(Vn\9 jk \ > avlrm) + I{Vn\d jk - 6 jk \ > aVhvn) 

+ I(y/n\dj k — 9j k \ < aV\nn)I(y/n\9j k \ < bVm). 
Hence i?3 = -R31 + -R32 + -R33, where by Lemmas 1 and 3 

/ Jo 2J-1 \ 
^31 = E E^V(v^M>«v1n^)] 
\j'=io+i fc=o / 



(5.29) 



/ Jo 2J-1 

o E E 

\7=jo+l fe=0 



? 2 fc + £(^- %fc ) 2 /^<^ 



1 



n a 2 Inn 
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= ( n - 2r /( 2r + 1 )), 

/ Jo 2^-1 

R32 = oi ]T E E[d%I(y/E\d jk -e jk \>aVh^)} 
v=io+i fc=o > 

/ Jo 2J-1 



° E Etv^-M 4 

\j=j +l k=0 



(5.30) x V^d^'fe " jk \V^ > aVbm) + d 2 k ]j 

/ Jo 2J-1 \ 

=o[ e E^^ 2r/(2r+1) +^] 

Vj'=io+i fc=o / 
= 0(2 J °n-( 4r+1 )/( 2r+1 )) = (n- 2r /( 2r+1 )) 

provided a 2 > 8a 2 r/(2r + 1). To derive an asymptotic upper bound for -R33, 
note that 

I(\/n\djk — @jk\ < aVlnn)I(y/n\6jk\ < a Vlnn ) 

< /(-v/^l^jfcl < 2a vlnn ) 

^ 531 ) </(v^l^fcl <2aV^)lh^(Vhm) Xri 0^) 

+ /^(v1n^)" A ''=0(l)) 

< /(^^(v^l^fel)^ - 0) + I(2^ +1 ) = 0(n(lnn) A ")). 
Note that Lemma 2 and (y/n\dj k \) Xri — ► imply that 

^[[/i^^/Zo^O] 2 ^"'^^!^^)^ - 0)] 

= 0(n l /- 4 ( v ^|d jfc |) 2A "/( l /ri^(v^l^A ; |) A " - 0)) 
= 0( z ,r 2 ) = 0(2-( 2r+1 ^). 

Therefore, by calculations similar to (5.27), the portion of -R33 corresponding 
to the first term in (5.31) is 0(n- 2r ^ 2r+1 ^). 

By Lemma 1, E[Ii(djk)/Io(djk)] 2 = 0(E[dj k - 9j k ] 2 + 6 2 k ), so the second 
term in the portion is 

o( E 2 ^[n' 1 +9%}I[2^ =0{n l ^ 2r+l \\nn)^^ 2r+l ^ 

\j=30+l k=Q J 
= 0(n - 2 r/(2r+l) (lnn) A r) /(2r+l) )> 
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Consequently, 

(5.32) R 33 = 0(n- 2r /( 2r+1 )(lnn) A "/( 2r+1 )), 

and (5.29), (5.30) and (5.32) imply that R 32 i = 0(n- 2r /( 2r+1 )(lnn) A "/( 2r+1 )). 

Derivation of an asymptotic upper bound for is, by and large, similar 
to that for R 3 . First, assume that condition (3.22) holds. Rewrite (5.28) 
and (5.31) with ay/Inn replaced by g(n) and observe that under assump- 
tion (3.22) the second indicator /(2^ 2r+1 ) = 0(n(lnn) A ")) in (5.31) van- 
ishes. Now, same as before, R41 = o(n _2r /( 2r+1 )) and R42 = 0(n~ 2r ^ 2r+1 ^) 
by (5.8). To ensure that .R43 = 0{n~ 2r ^ 2r+l ^) repeat calculations for the -R33 
corresponding to the first term in (5.31). 

If assumption (3.22) is violated, the only term in the proof which is af- 
fected is i?43. Namely, the second indicator in (5.31) does not vanish. Re- 

call that R 43 = O^Zl+iEl'jo 1 E[d%HV^\d jk \ < 2g(n))]). Since \§ jk \ < 
\h{djk)\/[l3jn\/ni-]j{y/nd jk )} and \h{d)\ < r](0)^/niy~ 1 J^° 00 \z\^(z)dz, we de- 
rive that, by (3.23), R i3 = O(E/=} +i ELV E [Pjn^Vj(MnW 2 )- To com- 
plete the proof note that the assumption that fj, has at least four finite mo- 
ments implies that g(n) < ^/C^n r ^ 2r+1 \ Consequently, R^ 3 = 0(n~ 2r ^ 2r+l ">). 
□ 



Proof of Corollary 2. The proof follows directly from Lemma 2. 

□ 



Proof of Theorem 4. The fact that condition (3.25) replaces (3.8) 
affects the term R\\ only. Denote Aj k = h(dj k )/ Io(dj k ) — li(0jk) / Io(9jk) 
and observe that 

j-12^-1 

(5.33) R n < 2 £ £ [E(h(6 jk )/I (e jk ) - 6 jk f + EA 2 k ] = 2(R in + R112). 

j=L k=0 



For an upper bound on Rm note that (5.11) implies that \6j k \ < v A2 Jr , 
so that (vj\0 jk \) x iVj/y/n = 0(2^ r+ V 2+1 / 2 V\/n ). The last expression does 
not turn to zero only if 2? > Cn l /( 2T+l+x ^ for some C > 0. Also, it follows 
from Lemma 2 that 

(5.34) (i/^D^A/^O \h(e jk )/I (6 jk )-9 jk \=O(l/^). 
Hence, 

(jo \ ( jo V-\ \ 

Rn = O [J2 2 j /n + O £ £ e%I[X > Cn 1 ^^} 

\j=L I \j=L k=0 / 

= 0(n -2r/(2r+A e +l) ) _ 
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To examine R112 consider three separate cases. If both {vj\9j k \) x tvj / yfn 
and (uj | djk \ ) x ^Vj/ ^/n tend to zero, then Aj k < \h(djk)/Io(djk) — djk\ + \dj k — 
9jk\ + \h(6jk)/Io(6jk) — Gjk\ and, by Lemma 2 and (5.34) we 
have EAj k = 0(l/n). Consequently, the respective portion of Rn 2 is 

0(?i~ 2r /( 2r+1 )). If (vj\6jk\) x ti'j/ y/n does not tend to zero, then, as was men- 
tioned earlier, 2- J > Cn l ^ 2r+l+x ^ for some C > 0. Hence the respective por- 
tion of R 112 is 0(Zf =L El=ol e %I(y > Cn^+i+A,)) + E(d jk - 9 jk ) 2 ]) = 
^ n -2r/(2r+A ? +l))_ Thg third case occurs if [y. |^. fc | ) \ uj/^Ti -> but 
(^IdjfcD^^j/v^ does not tend to zero. Then A 2 k = 0(l/n) + 0(|(ijfc — 
9 jk \ 2 ) + 0(d 2 fc ) and V^l^fc - Gjk\ > C*(n/i/ 2 )(A ? +i)/2A^ Considering cases 
njv 2 > Inn and n/u 2 < Inn separately and applying Lemma 3, we obtain 
R 112 = 0(n- 2r /( 2r + x i+V). □ 

The proof of Corollary 5 is based on the following lemmas. 

Lemma 5. Let £(•) andrjj(-) =??(•) be such that \h(d) / Io(d) - d\ > \d\/2 
for d= x /A/22~ i ° r with i < j . Then R(n, H r (A)) > (A/8)2~ 2i ° r . 

PROOF. Consider f(x) = 6>* 0)0 Y>i ,o(» with 6>* o0 = ^/A/2 2~ ior , i > L. It 
is easy to check that 6>j 0i o = 6* and 6j k = otherwise. Thus, the coefficients 
9jk satisfy condition (5.11) with A 2 = A and p = 2. For this function /, under 
the condition of Lemma 5, the bias term exceeds R± > C 2 Yjj°=L ^ffc = 

e 2 )fi > (A/8)2~ 2 ^. □ 

Lemma 6. If £(x) is a p.d.f. of the normal distribution and rj(x) is a 
p.d.f. of the double- exponential or the Student t distribution, then Lemma 5 
holds with iq = (2r + 2) log 2 n. 

Proof. Let Vj/y/n — ► 0, and d = ^A/2 2~ jr . If n(x) is the Student t dis- 
tribution, then direct calculations show that \Ii(d) / Io(d) — d\ ~ \d{C u a 2 / {vjd) 2 — 
1)| where C v is a constant depending on the degrees of freedom of the t dis- 
tribution; hence {I^d) / I (d) -d\>\d\/2 for j > i with 2 io > n 1 /! 2 ^ 2 ). 

In the case where ij(x) is the double-exponential distribution, 1 1\ (d) /Iq (d) — 
d\ ~ \d-a 2 ^uj 2 \ > \d\/2 for j > i with i satisfying 2 io > {%no 4 A- 1 ) 1 / { > 2r+1 \ 
□ 

Proof of COROLLARY 3. Corollary 3 follows directly from Lemmas 
5 and 6. □ 

The proof of Theorem 5 is based on the following lemma. 
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23-1 
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^^^V(^|^ fc |<(lnn) fc ) = 0(r 



3 k=0 
j k=0 



2r/(2r+l)j lnn |46r-/(2r+l)> 



^ ^ n- 1 /(V^|^fc| > (Inn) 6 ) = 0(n" 2r /( 2r+1 ) [lnn]" p6 ) 



Proof. The proof is an adaptation of the proof in Donoho, Johnstone, 
Kerkyacharian and Picard [13]. □ 

Proof of Theorem 5. Similarly to (5.15), partition R as R = R\ + 
R5 + R2 where R\ , R§ and R2 correspond to j < jo , jo < j < j\ and j > j\ , 
respectively, and consider each term separately. Note also that in condition 
(A4), 5 can be made as large as one desires if 7 > and 5 <u — 2 if 7 = 0. 

Low resolution levels: j < Jo- Note that Vj is chosen so that z^j/y / n = 

, so that the upper 



-2r/(2r+l)' 



o(l) andEf =i ELo (n- 2 v]+n-*vf9] k ) =0(n 
bound (5.19) for R\\ is still valid. Partition R12 = R121 + -R122 as in (5.20). 
Consider the cases 7 > and 7 = separately. If 7 > 0, then taking into 
account that Vj\djk\ > C$, 



(5.35) 



< exp< —A 



for any v > 0. In the case 7 = 



7/2 



1 



c s 



O 



(5.36) 



(l + m^/2 



Hence, following (5.22) and denoting u = v for 7 > and u = v for 7 = 0, 
we obtain 



(5.37) 



/ JO 



•ai)-(r+l/2-l/p)] 



0(n J _ 2 L r/{2r+1)) 



provided a\ satisfies (3.32) if 7 > (we have no restrictions on a if 7 > 0). 
The last term, R122, we partition into R122 = -R1221 + -R1222 depending on 
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the values of y/n\0jk\. Here 

/ jo V -l \ 
Ri22i=0[ E E[d 2 jk uAn{l,A 2 jn (d jk ))I(^i\e ]k \ < (Inn) 6 )] 

\j=L k=0 ) 

(5.38) / io v-i \ 
= O E E i n ~' + e %I{MO jk \ < (Inn) 6 )] 

\j=L k=0 ) 
= 0([n(ln?i)- 2b ]- 2r /^ +1 )) 

by Lemma 7. For i?i222> repeating (5.24) and (5.25), we derive 

/ io 2J-1 \ 

(5.39) #1222 = O E E n-%\[^/V^] 1+ai )I(^jk\ > (Inn) 6 )] 2 . 

By assumption on the decay of rjj we derive that 

(5.40) ^ ([„,/,/»] )<{ c(n/ ,2 ) (i +Q1 )/(2,) ) if 7 = . 
Consider the two cases separately. If 7 > 0, then by (5.40) and Lemma 7, 

(5 . 41) = 0[^f- E E nv^l^l > (Inn) 6 ^ 

= ( n -2r/(2r+l)[I nn i"2/7-p6) 

Now, to obtain (3.34), choose 6 minimizing max(46r/(2r + l),2/7 — pb). If 
7 = 0, then using (3.32) and (5.40), we derive by direct calculation that 



(5.42) 



i?i222 = o(En- 1 2^(n/z,|)( 1+ ^)/^ 

= 0(?^" 2r/(2r+1)+(1+ai)/M2r+1))(1/p ~ 1/2) ), 



which agrees with (3.34). 

High resolution levels: j\ + 1 < j ; < J — 1. Repeat (5.27) with jo replaced 
by ji. Then R 2 = 0(n- 2r /( 2r+1 )) follows from (3.29) and (5.27) and from 

the fact that £/=£+i ELV = 0(n _2r/(2r+1) ). 

Medium resolution levels: jo < j < ji . Partition #5 = Rq + -R7 where 
h z>-i 

(5.43) ii 6 = E E " %fc) 2 I(v^|^Jfe| > (Inn) 6 ), 
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and Rj has < instead of > inside the indicator function; then further parti- 
tion i?g into i?6i an d Rq2 in a manner similar to (5.17). Repeat (5.18) and 
note that m<i is chosen so that 



(5-44) £ eVK 2 + 0> ; 

Then, by Lemma 7, 



0(n 



-2r/(2r+l) 



)• 



31 2J-l r ^2 



(5 . 45) ««-ol E E 

v ; V/=.?o+l fe=0 

= 0(n -ar/(2r + l)- 



+ 



n 2 



+ 



a 2 I(V^\6 jk \ > (In 



Mi 



ii 



Partitioning Rq2 further into i?62i and Rq22 m a manner similar to R\2\ and 
R122 m (5.20), we derive 



^621 = E E^ 

V=io+i fe=o 



Ii(d ifc ) ^„(dj fc )J(i/,-|djfc| < C s ) 



(5.46) 



1 + A,n(djfc) 

x/(V^d>(hm) b ) 



n 



= 0(n- 2r K 2r+ V) + 0[ E /3 J 2 n ,^ +2<5 n-( 1+5 )2- 2 ^'- +1 / 2 - 1 ^ 
Vj=io+i " / 

= 0(n -2r/(2r+l) ) 

by Lemmas 1 and 7, (5.21), condition (3.32) on 02 and the choice of m,2- 
For i?622 we obtain an upper bound similar to (5.39): 

Rm2 = o[ E E ln "M^" 1 ([^7v^] 1 ^)] 2 /(v^|^|>(lnn) 6 )y 

\j=jo+l k=0 I 

(5.47) 

In the case where r]j(x) has exponential decay (7 > 0) we just repeat (5.41) 
to obtain (3.34). In the case 7 = denote (1 + ot,2)/v = h and observe that 



R 6 22 = 0[ E J2 n ~ 1 (n/v-) h I(^\0 jk \>(lnn) b ) 



(5.48) 



v=io+i fc=o 

/ h p /2-i+h v - 2h \\Q- IIP 

I . 4^ , (lnn)P fo 2 2 J'( r + 1 /2-i/p) i 
^ n -2r/(2r+l)+h(l/p-l/2)(r+l)/(r(r+l/2))> 
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which agrees with (3.34). Now, to complete the proof we need to consider 
the term R7 given by (5.43). Since 

1^ k 9 k \< \Il^ihlIl9^2hl ~ 1 \djk-Qjk\ 



(5.49) 



1 + Aj n (d jk ) 

M +\e jk \ 



1 + A jn (d jk ) 



O 



1 + A jn (d jk ) 

Uj v)\d jk \n^ \d jk -9 jk \ 



+ 



+ 



n 1 + A jn (dj k ) l + A jn (dj k ) 

o(\e jk \ + *+ \ d f- e f ), 



+ \0 jk \ 



we partition R-j into R71, R72 and -R73. Here, by Lemma 7 and (5.44) 



(5.50) 



(5.51) 



/ 31 2J-1 \ 

«n = E E ^V(v^l^fel < (Inn) 6 ) 
Vj=io+i fc=o > 



0([n(lnn) 



-26]-2r/(2r+ 



31 2^-1 



^ = o E E-> 2 



-2r/(2r 



\7=30+l fc=0 / 

The third term, -R73, we partition into -R731 and -R732 where 

(dj k — 6j k ) 2 



/ 31 2^-1 

^731 = E E £ 

Vi=io+i fc=o 

/ 31 23-1 

^732 = E E s 



(1 + A jn (d jk )) 2 
(dj k — 0j k ) 2 



(l + A jn (d jk )f 



1(1 < Vn\9 jk \ < (Inn) 
I(Vn\9 jk \<l)). 



(5.52) 

By Lemma 7, 



(5.53) 



R 



731 



/ 31 23-1 

E E n-^n- 1 < e%)I(V^\e jk \ < (Inn) 

0(n _ 2r . /(2r+1)[lnn]4W (2r+l) ) _ 



For an upper bound for -R732 note that since £(x) is bounded 
E[(d jk - 9 3k ) 2 (l + A jn (d jk ))- 2 ]I(^i\e jk \ < 1) 

(x - 9 jk f 



o 



(1 + fi jn <Jnv i 1 'q j (yfnx)) 2 
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= o(u]p^ n ^j^exp(-z 2 /2a 2 )[r lj (l + \z\)}- 2 d Z ^ 
= 0(v]pr* n - 2 ) 

by condition (A5). Hence, whenever a 2 > 0, by (5.44) we have 

(5.54) R 732 = ( jh E-K 2 )=0(n- 2 ^ +1 )). 

Vj=jo+i fc=0 / 

Combination of (5.37), (5.38), (5.41), (5.42) and (5.45)-(5.54) completes the 
proof. □ 

Proof of Theorem 6. Note that for j < ji, the derivation of the 
error is identical to that for Theorem 5. When j > ji, the proof is very sim- 
ilar to the proof of Theorem 3 with jo replaced by ji and A^ = 1. Again, 
since \6j k — 9j k \ < \0jk\ + \Qjk\ an d due to (3.29), we shall be concerned with 
the \0jk\ term only. Noting that \9j k \ < \I\{djk) / Io{djk)\ an d using the in- 
equality (5.28), we partition the error and derive upper bounds identical 
to those in (5.29), (5.30) and (5.32), which leads to Y%=a E 0jk ~ 

6 jk ) 2 = 0(n- 2r /( 2r+1 )(lnn) 1 /( 2r + 1 )). Derivation of the upper bound for 
S/=j +i Y^k=o E(@jk — 8jk) 2 repeats the derivation of R4 in Theorem 3, 
and similarly, we obtain E/=} 0+1 ELV E 0jk ~ 9jk) 2 = 0(n~ 2r /^+ 1 )). □ 



Proof of Theorem 7. The validity of Theorem 6 follows from The- 
orems 1, 3, 5 and 6. The only portion of the error which needs additional 
consideration is the one corresponding to j > Jq. Recall that 

, , * \)h{d jk ) + {l-\*)I{{d jk ) 

1 ' ) ^ \*h{d jk ) + (1 " \*)W jk ) + PjnV^VjiV^djk) ' 

where rjj(x) and Xj are given by (3.36), 

/oo . 
x 1 \/n(V2ir(Jo)~ exp(—n(d — x) 2 / (2a 2 .))i'j£(i'jx) dx, 
-00 

a = 0,1, 

/oo 
x ^ ^/n((^/n(d — x))i/j£(i/jx) dx, i = 0,l. 

-oo 
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Consider the cases when £ is a normal and a heavy-tailed prior separately. 
If £ is a normal prior, then by Lemma 2 



(5.56) 



R s= E e'^-M 2 

i=J +l fc=0 

/ J-l 23-1 r / T (1 

= o[ J2 J2 eW — 

\j=J +l k=0 

= o( E eVv^4+^7 4 

\j=J +l fc=0 



Uo(rfi 



+ 



^o(4?'fc) 



+ 



0(n- 2r /^ +1 )). 



+< 



If = 0, then partition Rg as = -Rsi + -^82 where 

so that -R8i = 0(n -2r /( 2r+1 )) by Lemma 2 and calculation similar to (5.56), 
and 



7* . 



3k 



R 



82 



o 



/ J-l 23-1 

E E^ 

Vj=J +l fc=0 



h(djk) 



xl(^M>M 



Let Ci( x ) be a heavy-tailed p.d.f. such that 

(5.57) C f) i (trover 1 exp{-x 2 /2ag} < Ci(s) < C c , 2 (x 2 + l) _1 C(x) 

for some positive C^i and C^. Then, it is easy to see that /o(^) < / v^Ci x 
(y/n(d — x))vj£(vjx) dx ~ y/nQi{yJnd) . Thus, by Lemma 2 y/nQ{y/ndj k ) / 
Io(djk) > C^lC(\/nd jk )/Ci(\/nd jk ) and 



R 



(5.58) 



/ ^ ^ [Ii(d jk )/I (d jk )] 2 I(u^ n \d jk \ > MY 

Ua+i^o ^(^-'CC^fcVCiC^)] 2 , 
°( E E ^[(n^fc) -1 ^ 1 ^ 1 "!^*! > 

\j=J +l fc=0 / 
0(rJ -2r/(2r. 



which completes the proof. □ 

Proof of Theorem 8. The proof is based on the following obvious 
corollary of Kolmogorov's three series theorem: if Z±, Z2, ■ ■ ■ , Z n are inde- 
pendent variables such that J2n E\Z n \ < 00, then the series J2 n %n converges 
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with probability 1 (see, e.g., [22], Section 4.2). Since we impose priors on 
each coefficient independently, it follows from the above that / G W if 



(5.59) 



S x = V 2J( r+1 /2-i/pk£ 



2 j -l 

£ 

Lfc=0 



~>jk\ 



< oo 



where the expectation is taken over the prior p.d.f. Tij n Vj^{vjx) + (1 — 
ffjn)5(0). Note that for a > we have E\0jk\ a = CaKjuVj 11 where C a = 
f \x\ a ^{x) dx provided the last integral is convergent. If q/p< 1, then us- 
ing Jensen's inequality E(\Y\ q / p ) < (E\Y\) q / p and the expression for E\6jk\ q 
we obtain 



(5.60) Sx < 2^ r+1 / 2 ~ 1 M q 



2 j -l 

£ E \Qjk\ 
k=0 



q/p 



Taking into account that (3j n = (1 + 7rj n ) _1 , we derive (3.41). 

If q > p, note that if / G Bp p , then / G -Bp.g, so it is enough to consider 
q = p. Repeating the previous calculation, we arrive at (3.41) again. □ 



Proof of Corollary 4. Plug the values of uj and (5j n into (3.41) and 
check that the sum (3.41) is uniformly bounded by a constant independent 
of n. □ 



Proof of Corollary 5. Plugging the values of i/j given by (3.28) 
into (3.41), we derive that (3.41) holds whenever the sum 5* = Y^L L ^ V x 

(3j n Tam ^ q ^ P ' lS) ^ s uniformly bounded, where 

r(l/p-l/2)/2, L<j<j , 
(5.61) v=l (l/p-l/2)(l + l/r), jo<j<h, 

U, J > ii - 

Now to complete the proof, check that S* is uniformly bounded whenever 
(3.43) is valid. □ 



The proof of Theorem 9 is based on the following lemma. 



Lemma 8. Under the conditions of Theorem 9, for some positive abso- 
lute constant C 

„ 62) >- ct^'"{^ > >(^)) 

-o( I i~ 2r/(2r+1) ). 
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Proof. Repeating the beginning of the proof of Theorem 1, we derive 
that R n (Bp q (A), f) >R>R\ where R± is defined in (5.15). Since for any 
random variables X and Y E{X + Y) 2 > 0.5EX 2 - EY 2 , we obtain that 
Ri > 0.5^12 - Rn where, just as in (5.19), R u = 0(n~ 2r /( 2r+1 )). Now, by 
(5.20) R\2 > R\2i- Consequently, 

(5.63) Rn(B r M (A), f) > 0.5R 121 - 0( n - 2r ^ 2r+1 ^) 

where R\2i is defined in (5.20). Recall (5.23) and by direct calculation verify 
that ctj n > Cs\fnvJ l if and only if 

(5.64) 0jn>vA^{CsyfcvJ X T l - 



For the values of j for which (5.64) holds, A 2 n (djk) > 1, so that [Aj n (djk) /(l + 
A jn(d jk ))} 2 > 0.25. Hence 

jo 2^-1 

R 121 > 0.25 E E^IiCsuj 1 < \d jk \ < a jn /^)] 
(5.65) J=i /o = ° 

> 0.25C| J2 Vvflifan > VjW^riiCsyfavJ 1 )}- 1 ). 

Plug the value of Vj into (5.65). The combination of (5.63)-(5.65) completes 
the proof. Note that in Theorem 1 the choice of (3j n makes the inequality 
otjn > Cs^fnvJ 1 impossible. □ 

Proof of Theorem 9. Plugging (3j n = 2 6j into (5.62) we obtain that 

0jn > ^[^(CsV^/vjT 1 Vj > C 5v ^[r?" 1 (2 j(r+1/2 - 6) n- 1 /2)]-i. 

Recalling that j < j , we note that 2^ r+l / 2 ~ b ^n~ 1 / 2 < n ~ b ^ 2r+l l Hence, by 
Lemma 5.62, 

Rn{Bl q {A)J) 
jo 

>CJ2 2~ 2rj I(Pjn > VjiVnritCsVn/vj)]' 1 ) 
j=L 

jo 
j=L 

= Cm- 2r/(2r+1) [?r 1 (n"' ,/(2r+1) )] 2r/(r+1/2) 

for some positive absolute constants C and C±, which proves (3.45). To finish 
the proof, consider various cases for rj(x). □ 
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